-
Notifications
You must be signed in to change notification settings - Fork 530
[WIP] MAEB task selection #3867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: maeb
Are you sure you want to change the base?
Conversation
Implements new task selection approach using correlation analysis and clustering for MAEB evaluation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <[email protected]>
- Add domain, category, and language checks to is_candidate_valid_removal to preserve at least one task from each unique domain, category, and language - Add top 5 longest tasks display for CLAP model reference timing - Add diagnostic cell for tasks with many negative correlations - Expand correlation thresholds to include 0.8 and 0.9 - Add Languages, Domains, Categories columns to summary table - Comment out license filtering to include all tasks - Handle empty model coverage gracefully with fallback logic 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
…ased tasks_to_keep - Move UMAP+HDBSCAN clustering right after initial correlation matrix - Define tasks_to_keep from outlier cluster (label -1) instead of empty list - Split function definitions to break circular dependency - Add domain counts cell after results DataFrame - Add model coverage distribution analysis (models at each task count) - Use models with >= 50 tasks for runtime estimation - Show task coverage in runtime output (N/M tasks with eval times) 🤖 Generated with [Claude Code](https://claude.ai/claude-code) Co-Authored-By: Claude <[email protected]>
- Add get_pairs_above_threshold helper to get all correlated pairs - Track skipped_pairs where neither task can be removed - Continue to next pair when current pair is protected - Clear skipped_pairs when task set changes after removal - Only stop when all pairs above threshold have been tried 🤖 Generated with [Claude Code](https://claude.ai/claude-code) Co-Authored-By: Claude <[email protected]>
Visualizes results_df with: - Blue gradient colormap (light to dark) - White background for NaN values - Adaptive text color (white for high scores, black for low) - Dynamic figure sizing based on data dimensions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add MAEB(audio-text) benchmark with 17 cross-modal retrieval tasks (8 audio-to-text, 9 text-to-audio) selected via correlation threshold 0.95 - Inline task lists directly in MAEB benchmark objects - Add threshold 0.95 to task selection notebook - Convert comparison plot from 1x5 to 2x3 layout for 6 thresholds - Fix tasks_to_select_from to use modality-filtered tasks - Use models with complete eval times for runtime estimation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Expand MAEB(audio-text) benchmark from 17 to 29 tasks (14 A2T + 15 T2A) - Fix msclap model revision from "N/A" to "no_revision" to match results cache - Update benchmark contacts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Script generates top 10 model rankings for MAEB(audio) and MAEB(audio-text) benchmarks using Borda count, with per-category averages. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally like marimo, but damn this is not the easiest thing to review. This is one of the cases where you really need the results to know what is filtered and why (having to git pull and run it to see seems like a big drawback). Is it possible to convert it to an .ipynb or .md for the results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya I can export a pdf or html or smth?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Created overview table for tasks and where they're used. Also version for google sheets https://docs.google.com/spreadsheets/d/1wyTvW0q6TIat7RMmfimlNKXri9O7cs_S0uebGTNya0c/edit?usp=sharing Table
scriptimport mteb
import pandas as pd
tasks = mteb.get_tasks(modalities=["audio"])
audio_tasks_names = [t.metadata.name for t in mteb.get_benchmark("MAEB(audio)")]
audio_text_tasks_names = [t.metadata.name for t in mteb.get_benchmark("MAEB(audio-text)")]
row = []
for task in tasks:
print(task.metadata.name)
in_audio = task.metadata.name in audio_tasks_names
in_audio_text = task.metadata.name in audio_text_tasks_names
row.append(
{
"Task Name": task.metadata.name,
"Task description": task.metadata.description,
"Task type": task.metadata.type,
"Task language(s)": ", ".join(task.metadata.eval_langs) if isinstance(task.metadata.eval_langs, list) else ", ".join(v[0] for v in task.metadata.eval_langs.values()),
"In MAEB(audio)": "Yes" if in_audio else "No",
"In MAEB(audio-text)": "Yes" if in_audio_text else "No",
}
)
df = pd.DataFrame(row)
df = df.sort_values(by=["Task Name", "Task type"]).reset_index(drop=True)
df.to_csv("audio_tasks_table.csv", index=False)
df.to_markdown("audio_tasks_table.md") |
|
Probably we can create english only version, but I'm not sure if it is relevant, because most of the tasks are english only |
|
Where are all the multilingual tasks? |
|
I think we can create
But this might be complicated to understand for users |
Why would it be complicated? Seems clear to me |
|
Hmm I would maybe do:
However, I would probably argue we could just make two columns that are PS: We have to fix the language annotations - birdset for example, is not English. |
How we should name it? Just
For leaderboard, I agree, but for the users I'm not sure because this can create problems on inference |
Ah I get it now, only maintain MAEB. Do we bother filtering out similar tasks? or use the entire collection? |
MAEB is the full Massive Audio Embedding Benchmark (v1), containing all tasks with audio modality across 7 task types: classification (35), clustering (10), pair classification (5), reranking (6), zero-shot classification (5), audio-to-text retrieval (18), and text-to-audio retrieval (17). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
I'm a bit afraid that if we use only 1 benchmark, but users would want to evaluate only on part of it, e.g. audio only. They would need to filter tasks |
|
What if we have an english list, an audio list, a "the rest of the collection" list, and MAEB is english + audio + "the rest"? We can still have MAEB(eng)v1, MAEB(audio)v1, and MAEBv1 ? |
Rename UrbanSound8kZeroshotClassification to UrbanSound8kClassification in audio_classification module to avoid collision with the identically named class in audio_zeroshot_classification module. Both classes had the same Python name but different task names: - audio_classification: task name "UrbanSound8k" - audio_zeroshot_classification: task name "UrbanSound8kZeroshot" The * imports caused the zeroshot version to overwrite the classification version, leaving only "UrbanSound8kZeroshot" registered in the task registry and breaking MAEB benchmarks that reference "UrbanSound8k". 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The dill/datasets library had a pickle incompatibility with Python 3.14. Datasets v4+ resolves this issue. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The v0.02 task class was defined but not exported in __init__.py, causing KeyError when referenced in benchmarks. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Renamed classes to match their metadata names so they can be found in the task registry: - JamAltArtist → JamAltArtistA2ARetrieval - JamAltLyricsT2A → JamAltLyricT2ARetrieval - JamAltLyricsA2T → JamAltLyricA2TRetrieval Also added explicit imports and exports for proper registration. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
2631fc8 to
411a4ce
Compare
This reverts commit b244226.
Zero-shot classification tasks require text modality and are now only in MAEB(audio-text). MAEB(audio) now has 24 tasks across 4 task types. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
I resolved #3877 and removed zeroshot tasks from the audio-only benchmark. |
Resolved conflict in any_2_any_retrieval/__init__.py, keeping correct class names: - JamAltArtistA2ARetrieval - JamAltLyricA2TRetrieval - JamAltLyricT2ARetrieval 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Just added the MAEB audio extended and lite benchmarks. These have 54 tasks, 38 models and 19 tasks, 44 models respectively. MAEB audio-text lite has 30 tasks, 10 models. This is done by finding the most number of tasks with the most number of model eval runs completed. No filtering is applied. @AdnanElAssadi56 @KennethEnevoldsen @Samoed would love a quick pair of eyes on these. I'd say we can probably start with these, and start filling in the relevant paper subsections. Audio, Extended
Audio, Lite
Audio-Text, Lite
|
Replace MAEB(audio) and MAEB(audio-text) with new benchmarks optimized for maximum model coverage: - MAEB(audio, lite): 19 tasks, 44 models with complete results - MAEB(audio, extended): 54 tasks, 38 models with complete results - MAEB(audio-text, lite): 30 tasks, 10 models with complete results Tasks selected via greedy algorithm maximizing models with all tasks. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The prerun loop was calling on_benchmark_select() and update_task_list() which return gr.update() objects, but then passing those objects to functions expecting raw lists. This caused cache corruption and Gradio validation errors when switching between benchmarks with different task types (e.g., from MAEB(audio-text, lite) with Any2AnyRetrieval to MAEB(audio, lite) without it). Fix by calling the underlying cached functions directly: - _cache_on_benchmark_select() instead of on_benchmark_select() - _cache_update_task_list() instead of update_task_list() 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Leaderboard fixes: - Cancel pending filter events when benchmark changes to prevent race conditions with stale values - Make _update_description derive counts from benchmark tasks directly instead of filter selections to avoid validation errors Benchmark changes: - Remove AudioCapsMiniReranking from MAEB, MAEB(audio, lite), and MAEB(audio, extended) - Update task counts in descriptions (96→95, 19→18, 54→53) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Look great! |
- Use MAEB(audio, lite) and MAEB(audio-text, lite) benchmarks - Table 1: Classification, PairClassification, Reranking, Clustering - Table 2: Retrieval, ZeroshotClassification - Make table functions accept task_names and benchmark_name parameters 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add alias mapping for task types that lose digits during column name processing (e.g., Any2AnyRetrieval -> AnyAnyRetrieval). Also add more audio models to annotation list. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Marimo notebook for analyzing evaluation times across MAEB benchmarks. Loads model metadata and task results to compare eval times between large and small models for audio and audio-text benchmarks. Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Great work, @isaac-chung! When we say audio-text, lite here, are we implying an extended version to the readers? |
It's the most complete collection bases on what we have run. We're missing results for an extended version. |
Resolve conflicts in pyproject.toml and uv.lock, taking maeb's version for speechbrain dependency constraint. Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add MAEB(audio-text, extended) benchmark with 36 tasks: - All 30 tasks from lite version - Clotho A2T/T2A for audio captioning - Fleurs A2T/T2A (102 languages) - CommonVoice 21 A2T/T2A (82+ languages) - Refine MAEB(audio-text, lite) to 17 tasks: - Remove redundant A2T tasks that have T2A equivalents - Remove SpeechCommandsZeroshotv0.01 (keep only v0.02) - Keep 13 T2A retrieval + 4 zero-shot classification - Add MAEB(audio-text, extended) to benchmark selector Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Added
|
New utility script that calculates total evaluation times for specified
benchmarks and models. Features:
- Takes --benchmarks and --models as required arguments
- Optional --results-dir for custom cache location
- Outputs formatted table with task coverage and times per benchmark
- Shows totals per model
Usage:
python scripts/calculate_eval_times.py \
-b "MAEB(audio-text, lite)" "MAEB(audio-text, extended)" \
-m "OpenMuQ/MuQ-MuLan-large" "laion/clap-htsat-unfused" \
-r /path/to/results
Co-Authored-By: Claude Opus 4.5 <[email protected]>
Computes Spearman and Pearson correlations between MAEB lite and extended benchmark variants to validate that lite benchmarks preserve model rankings. Outputs correlation values and scatter plots (PNG and PDF). Co-Authored-By: Claude Opus 4.5 <[email protected]>
Resolve merge conflicts in audio task imports: - Update JamAlt and AudioCaps imports in any_2_any_retrieval - Remove moved files from eng classification imports Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
@AdnanElAssadi56 @Samoed @KennethEnevoldsen I've update both this branch AND the paper draft based on the following: MAEB Benchmark Summary
Notes:
|
The __init__.py was importing UrbanSound8kZeroshotClassification but the class is actually named UrbanSound8kClassification in the source file. Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Great work! Maybe we can create versions with only English? |
|
Thanks! I feel our recurring theme overall has been maintainability - and that drives us to keep the number of benchmarks low. As such, I feel a modality split is the only key factor to have separate benchmarks. This way, we can also make a claim that since it's inherently multilingual, we incentivize/nudge the community to develop better multilingual audio embedding models. For English subsets, perhaps we only show a doc example on how to filter tasks? |
I was thinking this would be the ideal behaviour. We could easily add code on that (we could even add the benchmarks), but without adding multiple views in the leaderboard. I really agree with:
Which is why I would rather have a single benchmark with filters. I think this aligns fairly well with what we have now though. This is how I would phrase it in the paper. We construct a broad range of tasks. We call this collection MAEB+. This is the unreduced set - extended. Then we actual benchmark is a condensed version of this (MAEB+ never becomes a released benchmark it is just a collection of tasks used to construct MAEB). What do you guys think? I am unsure if we want to keep audio and audio-text seperated though. Here I am leaning combine, but it is only a small preference though (will look more at the paper to figure out what is best)
I agree that default multilingual (and potentially default multimodal, audio-text?) is a good incentive to provide. People will be interested in the English column, but we can provide that. Questions
(will look more in the paper as well) |
|
Overall, I think we can include English only (or multilingual) benchmark without
What is the problem with maintainability? |
|
Modality split seems the most practical still: There are just a lot more audio-only embedding models, and fewer audio-text-capable models.
💯 I think an English column and a
High number of benchmarks lowers maintainability. |



See the draft benchmarks. (For audio-text I actually use the full collection, no filtering) You'll also find the filtering notebook and the script to generate "Table 1".
@KennethEnevoldsen @AdnanElAssadi56 maybe another one for environmental or something?